Goto

Collaborating Authors

 optimizing subnetwork adaptively


Fine-Tuning Pre-Trained Language Models Effectively by Optimizing Subnetworks Adaptively

Neural Information Processing Systems

Large-scale pre-trained language models have achieved impressive results on a wide range of downstream tasks recently. However, fine-tuning an extremely large-scale pre-trained language model on limited target datasets is often plagued by overfitting and representation degradation. In this paper, we propose a Dynamic Parameter Selection (DPS) algorithm for the large-scale pre-trained models during fine-tuning, which adaptively selects a more promising subnetwork to perform staging updates based on gradients of back-propagation. Experiments on the GLUE benchmark show that DPS outperforms previous fine-tuning methods in terms of overall performance and stability, and consistently achieves better results with variable pre-trained language models. In addition, DPS brings a large magnitude of improvement in out-of-domain transferring experiments and low-resource scenarios, which shows that it can maintain stable general contextual features and reduce the representation collapse.


Appendix for " Fine-Tuning Pre-Trained Language Models Effectively by Optimizing Subnetworks Adaptively "

Neural Information Processing Systems

In Sec.3.3, we have experimentally verified that DPS outperforms various fine-tuning methods. Table 1: Eight datasets used in this paper form GLUE benchmark. In this paper, we investigate the performance of DPS on five distinctive and widely used large-scale pre-trained language models, namely BERT Devlin et al. [2018], RoBERTa Liu et al. [2019], DeBERTa improves Transforme-based pre-trained model with disentangled attention mechanism and enhanced mask decoder. We use mixed precision training to speed up the experimental process. This method is applied by ELECTRA when fine-tuning downstream tasks. 2 D Appendix D. Experimental Details for Different Fine-tuning Methods The following is our hyperparameter search space for different fine-tuning regularization methods: Mixout We grid search Mixout probability p {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8}.


Fine-Tuning Pre-Trained Language Models Effectively by Optimizing Subnetworks Adaptively

Neural Information Processing Systems

Large-scale pre-trained language models have achieved impressive results on a wide range of downstream tasks recently. However, fine-tuning an extremely large-scale pre-trained language model on limited target datasets is often plagued by overfitting and representation degradation. In this paper, we propose a Dynamic Parameter Selection (DPS) algorithm for the large-scale pre-trained models during fine-tuning, which adaptively selects a more promising subnetwork to perform staging updates based on gradients of back-propagation. Experiments on the GLUE benchmark show that DPS outperforms previous fine-tuning methods in terms of overall performance and stability, and consistently achieves better results with variable pre-trained language models. In addition, DPS brings a large magnitude of improvement in out-of-domain transferring experiments and low-resource scenarios, which shows that it can maintain stable general contextual features and reduce the representation collapse.